Skip to content

fix(client): self-healing for permanently stuck expired shape handles#4087

Merged
KyleAMathews merged 11 commits intomainfrom
pr-4085
Apr 10, 2026
Merged

fix(client): self-healing for permanently stuck expired shape handles#4087
KyleAMathews merged 11 commits intomainfrom
pr-4085

Conversation

@KyleAMathews
Copy link
Copy Markdown
Contributor

Summary

Expired shape handle entries in localStorage can get permanently stuck, preventing data from ever loading for affected shapes. This adds a self-healing retry mechanism that clears the poisoned entry and retries once, allowing automatic recovery even when a proxy strips cache-buster query parameters.

Based on #4085 by @evan-liveflow — refined with additional hardening from code review.

Root Cause

When a shape gets a 409 (handle rotation), the client stores the old handle in localStorage['electric_expired_shapes']. On future requests, if a response contains that handle, the client treats it as a stale cached response and retries up to 3 times with cache-buster params.

The problem: if a proxy (e.g., phoenix_sync) strips query parameters, the cache busters are ineffective. All 3 retries fail, FetchError(502) is thrown to onError, and if onError doesn't retry, the stream dies. The expired entry persists in localStorage, so the next session hits the same wall — permanently.

Since the server never reuses handles (now documented as SPEC.md S0), the expired entry becomes a false positive once the caching layer clears — but the client has no way to discover this.

Approach

After stale cache retries exhaust (3 attempts), the client now:

  1. Always clears the expired entry from localStorage — if cache busters didn't work, keeping the entry only poisons future sessions
  2. Attempts one self-healing retry — resets the stream and retries without the expired_handle param. Since handles are never reused, the fresh response will have a new handle and won't trigger stale detection
  3. Guards against infinite loops via #expiredShapeRecoveryKey (once per shape key, reset on up-to-date)
if (transition.exceededMaxRetries) {
  if (shapeKey) {
    expiredShapesCache.delete(shapeKey)       // always clear
    if (this.#expiredShapeRecoveryKey !== shapeKey) {
      this.#expiredShapeRecoveryKey = shapeKey // remember we tried
      this.#reset()                            // fresh start
      throw new StaleCacheError(...)           // caught internally → retry
    }
  }
  throw new FetchError(502, ...)               // truly give up
}

Key Invariants

  • S0: Server handles are unique and never reused (phash2 + microsecond timestamp, SQLite UNIQUE INDEX, ETS insert_new)
  • Self-healing fires at most once per shape per retry cycle (#expiredShapeRecoveryKey guard)
  • Guard resets on up-to-date, so long-lived streams can self-heal again if CDN misbehaves later
  • Expired entry is cleared on every exhaustion, regardless of whether self-healing fires

Non-goals

  • TTL on expired cache entries — the self-healing mechanism handles the failure mode without added complexity
  • Changing onError contract — the fix works regardless of what the user's onError callback does

Verification

cd packages/typescript-client
pnpm vitest run --config vitest.unit.config.ts
# 312 tests pass
pnpm exec tsc --noEmit
# Clean

Files changed

File Change
src/client.ts Self-healing logic in #onInitialResponse, recovery key cleared on up-to-date, updated catch block comment
test/expired-shapes-cache.test.ts Updated 2 existing tests for self-healing flow, added test for CDN-always-stale scenario
SPEC.md Added S0 (handle uniqueness guarantee), updated L3 loop-back entry and guard table
.changeset/fix-expired-shapes-self-healing.md Changeset for patch release

Based on #4085

evanob and others added 2 commits April 2, 2026 20:22
When stale cache retries exhaust (3 attempts), clear the expired entry
from localStorage and retry once without the expired_handle param.
Since handles are never reused (SPEC.md S0), the fresh response gets a
new handle and bypasses stale detection. This prevents shapes from being
permanently unloadable when a proxy strips cache-buster query params.

Also documents the server handle uniqueness guarantee (S0) in the spec,
updates the loop-back table for the new self-healing path, and resets
the recovery guard on up-to-date so self-healing remains available for
long-lived streams.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 19efb39
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69d92c95e16c9600087983a7
😎 Deploy Preview https://deploy-preview-4087--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Apr 3, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@4087
npm i https://pkg.pr.new/@electric-sql/client@4087
npm i https://pkg.pr.new/@electric-sql/y-electric@4087

commit: 424fe62

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.78%. Comparing base (b449f70) to head (424fe62).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4087   +/-   ##
=======================================
  Coverage   88.78%   88.78%           
=======================================
  Files          25       25           
  Lines        2452     2471   +19     
  Branches      615      627   +12     
=======================================
+ Hits         2177     2194   +17     
- Misses        273      275    +2     
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.86% <100.00%> (-0.05%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 88.78% <100.00%> (+<0.01%) ⬆️
unit-tests 88.78% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

KyleAMathews and others added 3 commits April 3, 2026 09:16
…test

The test waited for fast-loop detection to error, but the exponential
backoff (100ms-5s across 5 detections) takes longer than the timeout
in CI. Simplified to verify self-healing fires and the entry is cleared
— the fast-loop error path is already tested in stream.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The #expiredShapeRecoveryKey guard was only cleared in #onMessages when
an up-to-date batch arrived. The 204 backward-compatibility path
transitions directly to LiveState without going through #onMessages
(empty body → batch.length === 0 → early return), leaving the guard
stuck. This prevented a second self-healing cycle on the same stream
instance.

Clear the guard in #onInitialResponse when the response transitions
directly to live (action=accepted, state=live), covering the 204 path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…test

The caughtError===null assertion was environment-sensitive: the fast-loop
detector's 500ms window can catch more requests on slower machines,
firing a 502 that's orthogonal to the recovery guard bug being tested.

The precise signal is selfHealCount===2: if the guard is stuck, the
code throws 502 *before* incrementing, so selfHealCount stays at 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@balegas
Copy link
Copy Markdown
Contributor

balegas commented Apr 10, 2026

Code Review

Overview

The core fix is correct. I traced the stale-retry → self-healing flow end-to-end and verified body cancellation, guard clearing race-freedom, and reset ordering all hold. SPEC.md S0 is the right place to anchor the handle-uniqueness assumption the mechanism depends on.

Suggestions

1. The 204 guard-clearing check is looser than the comment suggests (client.ts:1244-1246)

if (transition.action === `accepted` && this.#syncState.kind === `live`) {
  this.#expiredShapeRecoveryKey = null
}

The comment says "e.g., 204" but the check fires whenever accepted + state becomes live. It works today only because the state machine doesn't transition to live at response-metadata stage for 200s — if that assumption ever changes, this will start clearing the guard prematurely. Consider either:

  • Narrowing to an explicit status === 204 check (matches the comment's intent), or
  • Adding a state-machine assertion/test to lock in the assumption

Not blocking — defensive hardening.

2. #reset() comment is now inaccurate (client.ts:1770)

The comment on #reset() says it's called from within the running stream loop (#requestShape's 409 handler) — self-healing is now a second caller. A future edit that relies on "only 409 handler" semantics could break self-healing. One-line fix:

it's called from within the running stream loop (#requestShape's 409 handler
or stale-retry self-healing)

3. Self-healing silently accepts stale data

The "CDN always returns stale handle" case documents that after self-healing clears the expired entry, the client happily accepts the same stale response. The trade-off (stale data > permanent 502) is the right call, but there's no post-hoc signal to the user that they're on stale data. Consider logging a follow-up warning when a post-self-healing response arrives with the handle that was just marked-and-cleared. Low priority — could be a separate PR if telemetry becomes a need.

Test concerns

1. Timing-sensitive setTimeout waits (200ms, 500ms, 1000ms)

The test history (1f7a5f4, 3c90f21) shows the author already fought flakiness here. Consider adopting the deterministic FetchGate pattern from #4089 once that lands — turn-based gating eliminates all timing races. The two PRs are complementary.

2. The 204-path test is doing too much (should clear recovery guard after 204 so self-healing works again)

105 lines of mock logic with multi-phase state (selfHealCount, healingDone, mid-test markExpired, manual aborter.abort()), and asserts only selfHealCount === 2 because the fast-loop detector may fire on slower machines. That's a signal the test is conflating concerns. Consider splitting:

  • Unit test: "204 clears recovery guard" against #onInitialResponse directly
  • Integration test: "Self-healing can fire multiple times after recovery guard clears"

3. markExpired mutation inside fetchMock.mockImplementation (test/expired-shapes-cache.test.ts:418)

Mutating expiredShapesCache from inside the fetch mock couples the test to an event ordering that's not obvious to future readers. At minimum, a // Simulate a second 409 happening mid-stream comment; ideally, restructure so the mutation happens outside the mock callback.

Recommendation

Approve with minor suggestions. Fix #1 (tighten the 204 check) and #2 (comment update) are cheap safety wins worth doing before merge. The rest are non-blocking.

KyleAMathews and others added 6 commits April 10, 2026 10:29
Addresses three findings from external review of the self-healing PR:

1. Detect and warn when a post-self-heal response carries the same
   handle we just marked expired. Previously the client silently
   accepted stale data with no operator signal — now it emits a
   targeted warning naming the handle and pointing at the proxy
   cache-key misconfiguration that causes this.

2. Tighten the recovery-guard clearing check from
   `accepted + kind === live` to an explicit `status === 204`, matching
   the comment's intent and removing latent fragility if the state
   machine ever starts transitioning to live for non-204 responses.

3. Update the `#reset()` comment to list all three callers
   (#requestShape's 409 handler, #checkFastLoop, and stale-retry
   self-healing) instead of only the 409 handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two orthogonal CI fixes for the expired-shapes-cache test suite:

1. Silence stderr noise: many tests in this suite intentionally trigger
   stale-cache scenarios that produce expected console.warn output.
   Install a shared beforeEach spy that mocks console.warn for all
   tests (tests that need to assert on warnings still can, via
   warnSpy.mock.calls).

2. Prevent unhandled FetchError(502) from fast-loop detector:
   - "CDN always returns stale handle" test: add onError handler and
     poll until self-heal fires, then abort explicitly so the stream
     can't loop in the background until #checkFastLoop throws.
   - "204 recovery guard" test: add onError handler so the fast-loop
     detector (which the test's own comment acknowledges may race
     on slow runners) can't leak as an unhandled rejection.

Both tests still assert the same signals — this change affects test
plumbing only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Package-local prettier 3.3.3 and root prettier 3.6.2 format markdown
blank-lines-before-lists differently. CI runs root prettier (strips the
line); the pre-commit hook picks up package-local prettier (re-adds it),
causing lint-staged to think the commit is empty when trying to fix CI.

Changesets auto-generates CHANGELOG files, so formatting them manually
isn't needed anyway.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Package-local prettier 3.3.3 formats markdown blank-lines-before-lists
differently than root prettier 3.6.2. CI runs root prettier; the pre-commit
hook picks up package-local prettier — they disagreed, causing lint-staged
to think every changelog fix was an empty commit.

- Bumped prettier to ^3.6.2 in experimental, react-hooks, start,
  typescript-client, y-electric, burn/assets, and redis packages
  (aligning with root 3.6.2; identified via sherif).
- Reverted the temporary CHANGELOG.md entry in .prettierignore now that
  the root cause is fixed.
- Reformatted both package CHANGELOG files with 3.6.2 to match CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Using bare "prettier --write" in lint-staged lets nano-spawn's PATH
resolution pick up globally-installed prettier (e.g. ~/.bun/bin/prettier
3.4.2) over the workspace's 3.6.2. That global version disagrees with CI
on markdown list formatting, causing the pre-commit hook to silently
revert the fix and then fail with "empty commit". Pointing explicitly at
node_modules/.bin/prettier guarantees lint-staged uses exactly the
version pnpm installed.

Also reformats the two CHANGELOGs that CI's format:check flagged.
@KyleAMathews KyleAMathews merged commit 690e25a into main Apr 10, 2026
43 checks passed
@KyleAMathews KyleAMathews deleted the pr-4085 branch April 10, 2026 18:22
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been released! 🚀

The following packages include changes from this PR:

  • @electric-sql/client@1.5.15

Thanks for contributing to Electric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants